Mi proyecto.png

Table of contents

  • Predict Customer Success
  • EDA: Explore active vs inactive customers
    • Setup
    • Basic EDA to customers
    • Basic EDA to customers sales
    • Feautres engineering
    • Active vs inactive behaviors plots
    • Customers characteristics
      • Customers locations
      • Type of customers
      • The customers type influence the MRR?
    • EDA Conclusions
  • Modeling
    • Extra Feautre engineering
      • Product level
        • Create clusters of products with this features
      • Longetivity of customer
      • Product Cluster descriptions
      • CLTV BY PRODUCT CLUSTER and customer
      • Locations clusters
      • CLTV
      • Join product cltv with customers data
    • Predictive models
      • target variables
      • Predict low vs high risk customers
        • classifiers
        • Prepare data for classfication
        • Define and run fun_enc5
        • Run encoding and split data
      • Model development and evaluation based on train-validation-test Cross-Validation
        • Logistic Regression evaluation
        • Random Forest evaluation
        • Suport Vector Machine evaluation
        • DecisionTreeClassifier evaluation
      • grid search:Higher values of class_weight for the minority class
      • Run the models with fewer features
    • Modeling Conclusions
      • Results
      • Future work
      • Disclaimer

Predict Customer Success¶

By: Addison Farber, Alvaro Gonzalez, Brian Bombasi, Stephen Smart

Advisor: Jeremy Morris

Swire Coca-Cola is seeking to improve its ability to predict the success of new restaurants in its market. To achieve this goal, we will analyze census data, consumer reviews, customer attributes, and Swire's sales to these customers. The data will be used to build a predictive model that can estimate the popularity, longevity, and total sales volume of new customers based on historical results. The project will deliver a predictive model, a report on the methodology and results, a software implementation of the model, training and support for stakeholders on how to use and interpret the results, an interactive dashboard to visualize and explore the results of the model, and a live presentation to showcase the findings and explain the methodology and results.

The project will be completed by a group of 4 MSBA students as a castone project and is expected to be completed by mid-April. The goal of this project is to provide Swire Coca-Cola with a data-driven approach to making informed decisions on pricing and funding, maximizing profitability, and minimizing risk. The interactive dashboard and live presentation will be essential tools for stakeholders to understand and make use of the results of the predictive model.

EDA: Explore active vs inactive customers¶

We will create a new column that will classify if a customer was active at the moment of the data pulling. Comparing if in the last 3 months they had a posting or not will let us know if it is active customer.

Setup¶

Import Data and libraries

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import pandas_profiling as pp
import warnings
warnings.filterwarnings('ignore')
from sqlalchemy import create_engine

##read FFSOP_Customer_Data_v2.0.xlsb
df = pd.read_csv(r'C:\Users\Alvaro\Documents\MSBA\FSOP_Customer_Data_v2.0.csv',parse_dates=['ON_BOARDING_DATE'])

##read FFSOP_Sales_Data_v2.0.xlsb
df2 = pd.read_csv(r'C:\Users\Alvaro\Documents\MSBA\FSOP_Sales_Data_v2.0.csv',parse_dates=['MIN_POSTING_DATE','MAX_POSTING_DATE'])

##create tables in postgres 
#df.to_sql('customer',engine,if_exists='replace')
#df2.to_sql('sales',engine,if_exists='replace')

#connect to postgres
engine = create_engine('postgresql://capstone:fFmDDtgjBotGbt5roo8eKfgQt@capstonedb.postgres.database.azure.com:5432/postgres')

#what is in the database
engine.table_names()

# read customer and sales data from postgres
df = pd.read_sql_table('customer',engine)
df2 = pd.read_sql_table('sales',engine)
C:\Users\Alvaro\AppData\Local\Temp\ipykernel_36440\503689038.py:11: DeprecationWarning:

`import pandas_profiling` is going to be deprecated by April 1st. Please use `import ydata_profiling` instead.

Out[ ]:
['product', 'customer', 'sales', 'customer_features']

Basic EDA to customers¶

In [ ]:
pp.ProfileReport(df)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[ ]:

Basic EDA to customers sales¶

In [ ]:
pp.ProfileReport(df2)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[ ]:

Feautres engineering¶

  • We will flag active customers when they have posted in the last 3 months.
  • MRR(PRODUCT):Monthly recurrent Revenue by product BLINDED(invoice/months active)
  • TMRR: Monthly recurrent Revenue by customer(invoice/months active)
  • TPM: Transactions Per month(transaction/months active)
  • UNIQUE_PRODUCTS:DIFFERENT PRODUCTS ORDERED BY A CUSTOMER
  • MONTHS ACTIVE

We will create a DF4 wich will compile the most valuble features per customer

In [ ]:
#creat a new column with the max date group by customer_number_blinded
df2['MAX_DATE'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['MAX_POSTING_DATE'].transform('max')
#creat a new column with the min date group by customer_number_blinded
df2['MIN_DATE'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['MIN_POSTING_DATE'].transform('min')
#difference between max and min date 
df2['DIFF_DATE'] = (df2['MAX_DATE'] - df2['MIN_DATE']).dt.days
#if df2['DIFF_DATE'] == 0 then 1 day else df2['DIFF_DATE']
df2['DIFF_DATE'] = np.where(df2['DIFF_DATE'] == 0, 1 , df2['DIFF_DATE'])

#convert the difference to  months and round to 0 decimals
df2['years_diff'] = round(df2['DIFF_DATE']/360,4)
df2['MONTHS_ACTIVE'] = round(df2['DIFF_DATE']/30,4)
# create a new column that tells if the customer is active or not depending if max date > 2022-9-30
df2['ACTIVE'] = np.where(df2['MAX_DATE'] > '2022-9-30', 'Active', 'Inactive')
# create a new column that adds the total of invoices per customer_number_blinded
df2['TOTAL_INVOICES'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['INVOICE_PRICE'].transform('sum')
# crerate a new column MRR that is INVOICE_PRICE/MONTHS_ACTIVE by customer_number_blinded
df2['TMRR'] = df2['TOTAL_INVOICES']/df2['MONTHS_ACTIVE']
# create a new column that counts the unique number of products by customer_number_blinded
df2['UNIQUE_PRODUCTS'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['PRODUCT_SOLD_BLINDED'].transform('nunique')
# SUM NUMBER OF TRANSACTIONS BY CUSTOMER_NUMBER_BLINDED BY years
df2['EXPECTED_TRANS_YEAR'] = df2.groupby(['CUSTOMER_NUMBER_BLINDED'])['NUM_OF_TRANSACTIONS'].transform('sum')/df2['years_diff']
# create new column total GROSS_PROFIT_DEAD_NET by customer_number_blinded
df2['TOTAL_PROFIT'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['GROSS_PROFIT_DEAD_NET'].transform('sum')
# create new column total volume by customer_number_blinded
df2['TOTAL_VOLUME'] = df2.groupby('CUSTOMER_NUMBER_BLINDED')['PHYSICAL_VOLUME'].transform('sum')

#CREATE A NEW DF WITH THE UNIQUE CUSTOMER_NUMBER_BLINDED AND TMRR, MONTHS_ACTIVE, ACTIVE drop duplicates and na values
df3 = df2[['CUSTOMER_NUMBER_BLINDED','TMRR','ACTIVE','UNIQUE_PRODUCTS','EXPECTED_TRANS_YEAR','MAX_DATE','TOTAL_PROFIT','TOTAL_VOLUME','TOTAL_INVOICES','years_diff','MONTHS_ACTIVE']].drop_duplicates().dropna()
#filter mrr > 0 & months active > 0
df3 = df3[(df3['TMRR'] > 0)]


#join df3 with df
df4 = df3.merge(df, on='CUSTOMER_NUMBER_BLINDED', how='left')
#split SALES_OFFICE_DESCRIPTION in sales city and sales state
df4['SALES_CITY'] = df4['SALES_OFFICE_DESCRIPTION'].str.split(',').str[0]
df4['SALES_STATE'] = df4['SALES_OFFICE_DESCRIPTION'].str.split(',').str[1]
#split DELIVERY_PLANT_DESCRIPTION in delivery city and delivery state
df4['DELIVERY_CITY'] = df4['DELIVERY_PLANT_DESCRIPTION'].str.split(',').str[0]
df4['DELIVERY_STATE'] = df4['DELIVERY_PLANT_DESCRIPTION'].str.split(',').str[1]
#split the zipcode in zipcode and zipcode plus 4
df4['ZIPCODE'] = df4['ADDRESS_ZIP_CODE'].str.split('-').str[0]

Active vs inactive behaviors plots¶

In [ ]:
#plot number of customers active and inactive in df3
fig = px.histogram(df3, x="ACTIVE", color="ACTIVE", title='Number of Customers Active and Inactive')
fig.show()

If we consider customers to be inactive when not posting in the last 3 months we have an imbalanced data set where a significant portion of customers are considered active, it is important to consider how this may impact the accuracy of our future predictive model. One potential next step is to try to balance the data set by oversampling the minority class (active customers) or undersampling the majority class (inactive customers). Another approach is to use different evaluation metrics, such as precision and recall, which take into account the false negatives and false positives in the predictions. Additionally, it may be useful to explore different machine learning algorithms that are specifically designed to handle imbalanced data sets, such as decision trees or random forests. Overall, addressing imbalanced data sets is an important consideration in building an effective predictive model that can accurately identify and target active customers.

In [ ]:
#histogram TMRR by active and inactive customers with a bin size of 100 and median line and notation
fig = px.histogram(df3, x="TMRR", color="ACTIVE", title='TMRR by Active and Inactive Customers', histnorm='probability density', marginal='box', hover_data=df3.columns)
fig.show()



#histogram of unique products by active and inactive customers
fig = px.histogram(df3, x="UNIQUE_PRODUCTS", color="ACTIVE", title='Unique Products by Active and Inactive Customers',histnorm='probability density', marginal='box', hover_data=df3.columns)
fig.show()


#histogram of expected transactions by active and inactive customers
fig = px.histogram(df3, x='EXPECTED_TRANS_YEAR', color="ACTIVE", title='Expected transactions per year by Active and Inactive Customers',histnorm='probability density', marginal='box', hover_data=df3.columns)
fig.show()

Customers characteristics¶

Here we will explore where the customers are from and what type of bussiness they have, who is there delivery or sales office.

In [ ]:
#create a  tmrr/max tmrr column
df4['TMRR_MAX'] = df4['TMRR']/df4['TMRR'].max()*10000
#map of customers by location and color by active and inactive
fig = px.scatter_geo(df4, lat="GEO_LATITUDE", lon="GEO_LONGITUDE", color="ACTIVE", size="TMRR_MAX",\
                         hover_name="CUSTOMER_NUMBER_BLINDED", hover_data=["TMRR", "UNIQUE_PRODUCTS"], \
                            title='Map of Customers by Location and Color by Active and Inactive')

fig.update_layout(
        title = 'Map of Customers by Location and Color by Active and Inactive',
        showlegend = True,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
        )
    )

fig.show()

In this map we can see the most inactive customers with big MRR are in the arizona Zone and the highest MRR active customers are in Colorado followed by Seattle and Oregon.

Customers locations¶

In [ ]:
#count of customers by ADDRESS_City and ACTIVE and plot top 20
fig = px.bar(df4.groupby(['ADDRESS_CITY','ACTIVE'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index().sort_values('CUSTOMER_NUMBER_BLINDED', ascending=False).head(100)
             , x="ADDRESS_CITY", y="CUSTOMER_NUMBER_BLINDED", color="ACTIVE", title='Count of Customers by ADDRESS_City and ACTIVE')
fig.show()

#count of customers by sales_State and active
fig = px.bar(df4.groupby(['SALES_STATE','ACTIVE'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index().sort_values('CUSTOMER_NUMBER_BLINDED', ascending=False)
                , x="SALES_STATE", y="CUSTOMER_NUMBER_BLINDED", color="ACTIVE", title='Count of Customers by SALES_STATE')
fig.show() 

#count of customers by delivery_State and ACTIVE
fig = px.bar(df4.groupby(['DELIVERY_STATE','ACTIVE'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index().sort_values('CUSTOMER_NUMBER_BLINDED', ascending=False) 
                , x="DELIVERY_STATE", y="CUSTOMER_NUMBER_BLINDED", color="ACTIVE", title='Count of Customers by DELIVERY_STATE and ACTIVE') 
fig.show()

We can see there is a relation between the columns of sales, delivery and customer location. So in the moment of creating the model we can do a better selection of features.

Type of customers¶

What is the business line of the customers?

In [ ]:
#plot customer trade channel by active and inactive sorted by count
fig = px.histogram(df4, x="CUSTOMER_TRADE_CHANNEL_DESCRIPTION", color="ACTIVE", title='Customer Trade Channel by Active and Inactive Customers',\
                     histnorm='probability density', hover_data=df4.columns, barmode='group',\
                        category_orders={"CUSTOMER_TRADE_CHANNEL_DESCRIPTION": df4['CUSTOMER_TRADE_CHANNEL_DESCRIPTION'].value_counts().index})

fig.show()
#PLOT 'MARKET_DESCRIPTION' BY ACTIVE AND INACTIVE CUSTOMERS
fig = px.histogram(df4, x="COLD_DRINK_CHANNEL_DESCRIPTION", color="ACTIVE", title='Market Description by Active and Inactive Customers',\
                     histnorm='probability density', hover_data=df4.columns, barmode='group',\
                        category_orders={"COLD_DRINK_CHANNEL_DESCRIPTION": df4['COLD_DRINK_CHANNEL_DESCRIPTION'].value_counts().index})

fig.show()

This may show that a customer in Full service , Amusment and recreation can trend to be more inactive.

The customers type influence the MRR?¶

In [ ]:
# plot averag TMRR by market description
fig = px.bar(df4.groupby(['COLD_DRINK_CHANNEL_DESCRIPTION'])['TMRR'].median().reset_index().sort_values('TMRR', ascending=False)
                , x="COLD_DRINK_CHANNEL_DESCRIPTION", y="TMRR", title='Median TMRR by Market Description')
fig.show()

#plot mean TMRR by customer trade channel description
fig = px.bar(df4.groupby(['CUSTOMER_TRADE_CHANNEL_DESCRIPTION'])['TMRR'].median().reset_index().sort_values('TMRR', ascending=False)
                , x="CUSTOMER_TRADE_CHANNEL_DESCRIPTION", y="TMRR", title='Median TMRR by Customer Trade Channel Description')
fig.show()

#plot mean TMRR by sales state
fig = px.bar(df4.groupby(['SALES_STATE'])['TMRR'].median().reset_index().sort_values('TMRR', ascending=False)
                , x="SALES_STATE", y="TMRR", title='Median TMRR by Sales State')
fig.show()

Supermarkets , wholesale and transportation are the type of customers that generate more revenue.

In [ ]:
#plot 'MARKET_DESCRIPTION' BY ACTIVE AND INACTIVE CUSTOMERS
fig = px.histogram(df4, x="COLD_DRINK_CHANNEL_DESCRIPTION", color="ACTIVE", title='Market Description by Active and Inactive Customers',\
                        histnorm='probability density', hover_data=df4.columns, barmode='group',\
                        category_orders={"COLD_DRINK_CHANNEL_DESCRIPTION": df4['COLD_DRINK_CHANNEL_DESCRIPTION'].value_counts().index})

fig.show()
In [ ]:
#plot customer trade channel by active and inactive sorted by count
fig = px.histogram(df4, x="CUSTOMER_TRADE_CHANNEL_DESCRIPTION", color="ACTIVE", title='Customer Trade Channel by Active and Inactive Customers',\
                        histnorm='probability density', hover_data=df4.columns, barmode='group',\
                        category_orders={"CUSTOMER_TRADE_CHANNEL_DESCRIPTION": df4['CUSTOMER_TRADE_CHANNEL_DESCRIPTION'].value_counts().index})

fig.show()
    

EDA Conclusions¶

  • We will need to review with the business the rules of considering an active or non active customer. The rule of 3 months will generate an imbalanced data set wich will requiere different strategies to find the best model.
  • We can predict the MRR acording the type of customer and locations and products ordered
  • the new features created will be significant in the predictive analysis
  • It is needed to find how the product types can influence this targets TMRR and Active.
  • The time as customer dosen't influence the targets.
  • We can scope into restaurants and foods cause even they dont produce a big MRR is the big segment of customers with highest rates of churn.

Modeling¶

In this part of the project, we will model to identify a customer who is worthy of a discount to sign a contract or not. When a customer is at risk of default, the business will lose money because there was no guarantee of a return on investment.

To achieve this goal, we will create clusters of products to analyze the behavior of each customer on each of these product clusters. Additionally, we will cluster the locations to determine if they are in a favorable or unfavorable place, and we will calculate the customer's lifetime value. Cluster models used

  • kmeans for products
  • dbscan for locations

We will use the following models to predict the risk of a customer defaulting on a contract:

  • Logistic Regression
  • Decision Trees
  • Random Forest
  • Support Vector Machines

This is because we are trying to predict a binary outcome (customer will have risk low or high ) and we have a relatively small dataset. Risk is defined by if customer has low Profit life time value or has not orderd in the last 3 months and has les than 3 years.

Extra Feautre engineering¶

We have discovered which are the features and how they intercat between them. Now we will create a DF selecting the features that will be useful to predict the value, status of a customer and if is risky for doing business.

Product level¶

We want to know the types of products and how they are consumed by the customers. We will cluster the 1558 product byits similarity on:

  • average of transactions in this period of time
  • number of customers ordering
  • Average profit by customer
  • average volume per transaction
In [ ]:
#new dataframe with average of NUM_OF_TRANSACTIONS by 'PRODUCT_SOLD_BLINDED'
df5 = df2.groupby(['PRODUCT_SOLD_BLINDED'])['NUM_OF_TRANSACTIONS'].mean().reset_index().sort_values('NUM_OF_TRANSACTIONS', ascending=False)
# inclued count of customers by 'PRODUCT_SOLD_BLINDED'
df5['#Customers'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['CUSTOMER_NUMBER_BLINDED'].count().reset_index()['CUSTOMER_NUMBER_BLINDED']
# Average of 'GROSS_PROFIT_DEAD_NET ' by 'PRODUCT_SOLD_BLINDED'
df5['AVG_GROSS_PROFIT_DEAD_NET'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['GROSS_PROFIT_DEAD_NET'].mean().reset_index()['GROSS_PROFIT_DEAD_NET']
# log average volume by 'PRODUCT_SOLD_BLINDED'
df5['AVG_VOLUME'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['PHYSICAL_VOLUME'].mean().reset_index()['PHYSICAL_VOLUME']
# average Revenue by 'PRODUCT_SOLD_BLINDED'
df5['AVG_INVOICE'] = df2.groupby(['PRODUCT_SOLD_BLINDED'])['INVOICE_PRICE'].mean().reset_index()['INVOICE_PRICE']
#  include 'BEV_CAT_DESC', 'CALORIE_CAT_DESC', 'PACK_TYPE_DESC', 'PACK_SIZE_SALES_UNIT_DESCRIPTION' in df5
df5 = df5.merge(df2[['PRODUCT_SOLD_BLINDED','BEV_CAT_DESC', 'CALORIE_CAT_DESC', 'PACK_TYPE_DESC', 'PACK_SIZE_SALES_UNIT_DESCRIPTION']], on='PRODUCT_SOLD_BLINDED', how='left')
df5 = df5.drop_duplicates()
df5.columns = df5.columns.astype(str)



df5.head()
Out[ ]:
PRODUCT_SOLD_BLINDED NUM_OF_TRANSACTIONS #Customers AVG_GROSS_PROFIT_DEAD_NET AVG_VOLUME AVG_INVOICE BEV_CAT_DESC CALORIE_CAT_DESC PACK_TYPE_DESC PACK_SIZE_SALES_UNIT_DESCRIPTION
0 M084802150628 128.200000 5 2388.546000 243.400000 15020.214000 NaN NaN Paper Cup 16 OZ 1-Ls 1000
5 M014603040787 108.816667 60 1115.548833 323.895833 3633.831000 CORE SPARKLING REGULAR CALORIE Plastic Bottle - Contour 2 LTR 1-Ls 8
65 M003403310963 104.666667 3 -610.623333 95.000000 4550.500000 NaN NaN Paper Cup 32 OZ 32OZ CUP480 PAPER
68 M093708470229 100.966102 59 685.979661 203.237288 2262.545085 CORE SPARKLING REGULAR CALORIE Plastic Bottle - Dimple 2 LTR 1-Ls 8
127 M088201310973 88.500000 2 1061.055000 167.000000 9811.250000 NaN NaN Paper Cup 16 OZ 1-Ls

Create clusters of products with this features¶

Data cleaning: remove duplicates and replaces missing values with zeros. Then it applies a log transformation on some columns that have high skewness.

Feature engineering: frequency encoding to categorical features (i.e., "BEV_CAT_DESC", "CALORIE_CAT_DESC", "PACK_TYPE_DESC", and "PACK_SIZE_SALES_UNIT_DESCRIPTION") to convert them into numerical values, which can be used in the clustering algorithm.

Data scaling: StandardScaler from scikit-learn to scale the data so that all features have the same range of values. This step is essential in the clustering process as it ensures that no feature is more important than another.

Clustering: KMeans clustering, which groups similar products into clusters based on their characteristics. It tries different numbers of clusters and calculates the Within-Cluster-Sum-of-Squares (WCSS) metric to determine the best number of clusters to use.

The output of this isthe list of WCSS values for different numbers of clusters. we can use this output to determine the optimal number of clusters based on the elbow method. The elbow method helps to identify the point where the decrease in WCSS begins to slow down, which is the optimal number of clusters to use for the given dataset.

In [ ]:
# Create clusters of products based on NUM_OF_TRANSACTIONS	#Customers	AVG_GROSS_PROFIT_DEAD_NET	AVG_VOLUME	AVG_INVOICE	BEV_CAT_DESC	CALORIE_CAT_DESC	PACK_TYPE_DESC	PACK_SIZE_SALES_UNIT_DESCRIPTION
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

#drop null values
#df6 = df5.dropna()
#drop duplicates
df6 = df5.drop_duplicates()
# do log transformation if -inf or inf
df6['AVG_VOLUME'] = np.log(df6['AVG_VOLUME'])
df6['AVG_VOLUME'] = df6['AVG_VOLUME'].replace([np.inf, -np.inf], np.nan)
df6['AVG_INVOICE']= np.log(df6['AVG_INVOICE'])
df6['AVG_INVOICE'] = df6['AVG_INVOICE'].replace([np.inf, -np.inf], np.nan)
df6['AVG_GROSS_PROFIT_DEAD_NET'] = np.log(df6['AVG_GROSS_PROFIT_DEAD_NET'])
df6['AVG_GROSS_PROFIT_DEAD_NET'] = df6['AVG_GROSS_PROFIT_DEAD_NET'].replace([np.inf, -np.inf], np.nan)
df6['NUM_OF_TRANSACTIONS'] = np.log(df6['NUM_OF_TRANSACTIONS'])
df6['NUM_OF_TRANSACTIONS'] = df6['NUM_OF_TRANSACTIONS'].replace([np.inf, -np.inf], np.nan)
df6['#Customers'] = np.log(df6['#Customers'])
df6['#Customers'] = df6['#Customers'].replace([np.inf, -np.inf], np.nan)
#drop 'PRODUCT_SOLD_BLINDED'
df6 = df6.drop('PRODUCT_SOLD_BLINDED', axis=1)
#df6 nan values to 0
df6 = df6.fillna(0)
#encode categorical variables
#df6['BEV_CAT_DESC'] = LabelEncoder().fit_transform(df6['BEV_CAT_DESC'])
#df6['CALORIE_CAT_DESC'] = LabelEncoder().fit_transform(df6['CALORIE_CAT_DESC'])
#df6['PACK_TYPE_DESC'] = LabelEncoder().fit_transform(df6['PACK_TYPE_DESC'])
#df6['PACK_SIZE_SALES_UNIT_DESCRIPTION'] = LabelEncoder().fit_transform(df6['PACK_SIZE_SALES_UNIT_DESCRIPTION'])

#frequency encode categorical variables using a function if nan then 0

def frequency_encoding(df, col):
    freq_enc_dict = df[col].value_counts().to_dict()
    df[col] = df[col].map(lambda x: freq_enc_dict.get(x, 0))
    return df



#frequency encode categorical variables
df6 = frequency_encoding(df6, 'BEV_CAT_DESC')
df6 = frequency_encoding(df6, 'CALORIE_CAT_DESC')
df6 = frequency_encoding(df6, 'PACK_TYPE_DESC')
df6 = frequency_encoding(df6, 'PACK_SIZE_SALES_UNIT_DESCRIPTION')

#scale data
scaler = StandardScaler()
df6 = scaler.fit_transform(df6)

#find the optimal number of clusters
wcss = []
for i in range(1, 20):

    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(df6)
    wcss.append(kmeans.inertia_)
    
In [ ]:
# plot elobw using plotly
fig = px.line(x=range(1, 20), y=wcss, title='The Elbow Method')

fig.show()

This chart indicates a cluster of 6 will be a good aproach. Is where the explained variation changes

In [ ]:
# create the cluster model
kmeans = KMeans(n_clusters=6, init='k-means++', random_state=42)
kmeans.fit(df6)
#df5['Cluster'] = kmeans.labels_.tolist()
df5 = df5.assign(Cluster=kmeans.labels_)

#plot clusters
fig = px.scatter_3d(df5, x='NUM_OF_TRANSACTIONS', y='AVG_GROSS_PROFIT_DEAD_NET', z='AVG_INVOICE', color='Cluster',\
                    title='Clusters of Products', hover_data=df5.columns)
fig.show()

Longetivity of customer¶

Using the activation date on the customers Demographics and the max date posted(created on the previous feature Engineering)

In [ ]:
# creat a new column for the log of the  difference of months between max_date and on_board_date in df4
df4['longetivity'] = (df4['MAX_DATE'] - df4['ON_BOARDING_DATE']).dt.days/365

#histogram of longetivity showing the median

fig = px.histogram(df4, x='longetivity', nbins=100, title='Distribution of Longetivity')
fig.add_vline(x=df4['longetivity'].median(), line_width=3, line_dash="dash", line_color="green")
fig.show()

Product Cluster descriptions¶

In [ ]:
# describe clusters
df7=df5.groupby('Cluster')['NUM_OF_TRANSACTIONS', '#Customers', 'AVG_GROSS_PROFIT_DEAD_NET', 'AVG_VOLUME', 'AVG_INVOICE'].mean()
# of products in each cluster
df7['#Products'] = df5.groupby('Cluster')['PRODUCT_SOLD_BLINDED'].count().reset_index()['PRODUCT_SOLD_BLINDED']
# most frequent of types of products in each cluster
df7['BEV_CAT_DESC'] = df5.groupby('Cluster')['BEV_CAT_DESC'].agg(lambda x:x.value_counts().index[0])
df7['CALORIE_CAT_DESC'] = df5.groupby('Cluster')['CALORIE_CAT_DESC'].agg(lambda x:x.value_counts().index[0])
df7['PACK_TYPE_DESC'] = df5.groupby('Cluster')['PACK_TYPE_DESC'].agg(lambda x:x.value_counts().index[0])
df7['PACK_SIZE_SALES_UNIT_DESCRIPTION'] = df5.groupby('Cluster')['PACK_SIZE_SALES_UNIT_DESCRIPTION'].agg(lambda x:x.value_counts().index[0])
#less frequent of types of products in each cluster
df7['BEV_CAT_DESC2'] = df5.groupby('Cluster')['BEV_CAT_DESC'].agg(lambda x:x.value_counts().index[-1])
df7['CALORIE_CAT_DESC2'] = df5.groupby('Cluster')['CALORIE_CAT_DESC'].agg(lambda x:x.value_counts().index[-1])
df7['PACK_TYPE_DESC2'] = df5.groupby('Cluster')['PACK_TYPE_DESC'].agg(lambda x:x.value_counts().index[-1])
df7['PACK_SIZE_SALES_UNIT_DESCRIPTION2'] = df5.groupby('Cluster')['PACK_SIZE_SALES_UNIT_DESCRIPTION'].agg(lambda x:x.value_counts().index[-1])


df7
Out[ ]:
NUM_OF_TRANSACTIONS #Customers AVG_GROSS_PROFIT_DEAD_NET AVG_VOLUME AVG_INVOICE #Products BEV_CAT_DESC CALORIE_CAT_DESC PACK_TYPE_DESC PACK_SIZE_SALES_UNIT_DESCRIPTION BEV_CAT_DESC2 CALORIE_CAT_DESC2 PACK_TYPE_DESC2 PACK_SIZE_SALES_UNIT_DESCRIPTION2
Cluster
0 16.739553 1081.587786 179.175388 28.252418 534.159858 262 CORE SPARKLING REGULAR CALORIE Aluminum Can 16 OZ 1-Ls 24 SPORTS DRINKS LOW CALORIE Plastic Bottle - Dimple 20 OZ 1-Ls 12
1 3.759816 70.614706 27.228056 6.810729 139.037550 340 ENERGY DRINKS REGULAR CALORIE Aluminum Can 16 OZ 4-Pk 24 JUICES/NECTARS LOW CALORIE Glass Bottle - Dimple 1 LTR 6-Pk 12
2 12.267936 398.000000 1698.628410 136.598598 6906.256432 186 PACKAGED WATER (PLAIN & ENRICHED) LOW CALORIE Paper Cup 32 OZ 1-Ls PACKAGED WATER (PLAIN & ENRICHED) LOW CALORIE Tank 20 OZ 1-Ls 1000
3 19.042104 624.950980 1773.613183 670.512028 6869.723816 204 CORE SPARKLING REGULAR CALORIE Aluminum Can - Sleek 7.5 OZ 6-Pk 24 PACKAGED WATER (PLAIN & ENRICHED) LOW CALORIE Tetra Pak 8.45 OZ 1-Ls 24
4 16.799252 1002.384127 133.051617 35.821840 826.219204 315 SPORTS DRINKS REGULAR CALORIE Plastic Bottle - Other 2.5 GALLON 1-Ls COFFEE LOW CALORIE Plastic Cartridge 12 OZ 12-Pk 12
5 2.494128 28.701195 -2.635169 0.108233 -4.030685 251 CORE SPARKLING REGULAR CALORIE Plastic Bottle - Other 20 OZ 1-Ls 24 DAIRY/SOY BEVERAGES LOW CALORIE Tetra Pak 16 OZ 10-Pk 20
In [ ]:
#send to postgres df5 as table 'product'
df5.to_sql('product', engine, if_exists='replace', index=False)
Out[ ]:
558

CLTV BY PRODUCT CLUSTER and customer¶

In [ ]:
#cjoin df5 with df2 and longetivity
df11= df2[['CUSTOMER_NUMBER_BLINDED', 'PRODUCT_SOLD_BLINDED','MIN_POSTING_DATE', 'MAX_POSTING_DATE','PHYSICAL_VOLUME',
           'INVOICE_PRICE','GROSS_PROFIT_DEAD_NET','NUM_OF_TRANSACTIONS']]\
.merge(df5[['PRODUCT_SOLD_BLINDED', 'Cluster']], on='PRODUCT_SOLD_BLINDED', how='left')\
.merge(df4[['CUSTOMER_NUMBER_BLINDED', 'longetivity']], on='CUSTOMER_NUMBER_BLINDED', how='left')
#creat a new column with the max date group by customer_number_blinded
df11['MAX_DATE'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['MAX_POSTING_DATE'].transform('max')
#creat a new column with the min date group by customer_number_blinded
df11['MIN_DATE'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['MIN_POSTING_DATE'].transform('min')
#difference between max and min date 
df11['DIFF_DATE'] = (df11['MAX_DATE'] - df11['MIN_DATE']).dt.days
#if df2['DIFF_DATE'] == 0 then 1 day else df2['DIFF_DATE']
df11['DIFF_DATE'] = np.where(df11['DIFF_DATE'] == 0, 1 , df11['DIFF_DATE'])
#convert the difference to  months and round to 0 decimals
df11['years_diff'] = round(df11['DIFF_DATE']/360,4)
# create a new column that adds the total of invoices per customer_number_blinded
df11['TOTAL_INVOICES'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['INVOICE_PRICE'].transform('sum')
# SUM NUMBER OF TRANSACTIONS BY CUSTOMER_NUMBER_BLINDED BY years
df11['EXPECTED_TRANS_YEAR'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['NUM_OF_TRANSACTIONS'].transform('sum')/df11['years_diff']
# create new column total GROSS_PROFIT_DEAD_NET by customer_number_blinded
df11['TOTAL_PROFIT'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['GROSS_PROFIT_DEAD_NET'].transform('sum')
# create new column total volume by customer_number_blinded
df11['TOTAL_VOLUME'] = df11.groupby(['CUSTOMER_NUMBER_BLINDED','Cluster'])['PHYSICAL_VOLUME'].transform('sum')


#revclvt 
df11['revclvt'] = df11['TOTAL_INVOICES']/df11['years_diff']*df11['longetivity']
#volclvt
df11['volclvt'] = df11['TOTAL_VOLUME']/df11['years_diff']*df11['longetivity']
#profitclvt
df11['profitclvt'] = df11['TOTAL_PROFIT']/df11['years_diff']*df11['longetivity']
#clusters as string
df11['Cluster'] = df11['Cluster'].astype(str)

#drop duplicates
df11 = df11.drop_duplicates()
df11=df11.drop(['MAX_DATE','MIN_DATE','DIFF_DATE','years_diff','PRODUCT_SOLD_BLINDED','MIN_POSTING_DATE', 'MAX_POSTING_DATE','PHYSICAL_VOLUME',
           'INVOICE_PRICE','GROSS_PROFIT_DEAD_NET','NUM_OF_TRANSACTIONS', 'TOTAL_PROFIT', 'TOTAL_VOLUME','longetivity', 'TOTAL_INVOICES'], axis=1)
df11.columns

#pivot table each column cluster and cutomer_number_blinded in rows and the values no agg

df12 = df11.pivot_table(index=['CUSTOMER_NUMBER_BLINDED'], columns='Cluster', values=['EXPECTED_TRANS_YEAR', 'revclvt','volclvt', 'profitclvt'] ).reset_index()



#multiindex to single index
df12.columns = ['_'.join(col).strip() for col in df12.columns.values]
#rename customer_number_blinded_ as customer_number_blinded
df12.rename(columns={'CUSTOMER_NUMBER_BLINDED_': 'CUSTOMER_NUMBER_BLINDED'}, inplace=True)
df12 = df12.fillna(0)
df12.head()
Out[ ]:
CUSTOMER_NUMBER_BLINDED EXPECTED_TRANS_YEAR_0 EXPECTED_TRANS_YEAR_1 EXPECTED_TRANS_YEAR_2 EXPECTED_TRANS_YEAR_3 EXPECTED_TRANS_YEAR_4 EXPECTED_TRANS_YEAR_5 profitclvt_0 profitclvt_1 profitclvt_2 ... revclvt_2 revclvt_3 revclvt_4 revclvt_5 volclvt_0 volclvt_1 volclvt_2 volclvt_3 volclvt_4 volclvt_5
0 C0001004007230617 51.413882 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0
1 C0001004306830869 1428.571429 0.0 0.000000 357.142857 357.142857 0.0 0.000000 0.0 0.0 ... 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.0
2 C0001005009010198 0.000000 0.0 0.000000 30.393084 0.000000 0.0 0.000000 0.0 0.0 ... 0.0 3975.667177 0.000000 0.0 0.000000 0.0 0.000000 306.961820 0.000000 0.0
3 C0001006902500643 274.142658 0.0 714.285714 0.000000 76.553044 0.0 11400.605126 0.0 0.0 ... 0.0 0.000000 6717.488658 0.0 730.978120 0.0 -82966.731898 0.000000 305.408667 0.0
4 C0001007600510372 357.142857 0.0 7.706281 16.090105 12.109870 0.0 20956.262231 0.0 0.0 ... 0.0 3851.972526 1753.093001 0.0 933.463796 0.0 -96.393355 39.426267 34.528963 0.0

5 rows × 25 columns

Locations clusters¶

We created a cluster of locations to identify the hot zones that are commercial areas and will help to determine if a location is a valuble feature for success

In [ ]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score
from sklearn.model_selection import ParameterGrid
from sklearn.preprocessing import LabelEncoder
import numpy as np

#drop null values
df9 = df4.dropna()

#drop duplicates
df8 = df9.drop_duplicates(['CUSTOMER_NUMBER_BLINDED'])

#drop 'PRODUCT_SOLD_BLINDED'
df8 = df8.drop('CUSTOMER_NUMBER_BLINDED', axis=1)[['GEO_LONGITUDE', 'GEO_LATITUDE',"ZIPCODE"]]

#df8['ADDRESS_CITY'] = LabelEncoder().fit_transform(df8['ADDRESS_CITY'])

# convert customer_data to a numpy array
X = np.array(df8)

# define a range of hyperparameters to search over
param_grid = {'eps': [0.05 , 0.1, 0.5, 1], 'min_samples': [2,3]}

# create a list to store the silhouette scores for each combination of hyperparameters
scores = []

# loop over each combination of hyperparameters
for params in ParameterGrid(param_grid):
    # fit the DBSCAN model with the current hyperparameters
    dbscan = DBSCAN(**params).fit(X)
    
    # calculate the silhouette score for the current clustering
    labels = dbscan.labels_
    #score = silhouette_score(X, labels)
    score=calinski_harabasz_score(X, labels)
    # store the silhouette score for the current hyperparameters
    scores.append({'params': params, 'score': score})

# find the hyperparameters with the highest silhouette score
best_params = max(scores, key=lambda x: x['score'])['params']

# fit the DBSCAN model with the best hyperparameters
best_dbscan = DBSCAN(**best_params).fit(X)


# get the labels assigned to each data point
labels = dbscan.labels_

df9 = df9.assign(location_Cluster=dbscan.labels_)
df9['location_Cluster'] = df9['location_Cluster'].astype(str)
In [ ]:
#plot clusters in a us map 
import plotly.express as px

fig = px.scatter_mapbox(df9, lat="GEO_LATITUDE", lon="GEO_LONGITUDE", color="location_Cluster", hover_name="COUNTY",\
                        hover_data=["ADDRESS_CITY", "SALES_STATE", "ZIPCODE"], zoom=3, height=600)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
In [ ]:
#drop high correlation columns
df10=df9.drop(['GEO_LATITUDE', 'GEO_LONGITUDE','ADDRESS_ZIP_CODE','DELIVERY_CITY', 'DELIVERY_STATE', 'ZIPCODE','TMRR','SALES_OFFICE_DESCRIPTION',
       'DELIVERY_PLANT_DESCRIPTION','SALES_CITY', 'SALES_STATE','CUSTOMER_TRADE_CHANNEL_DESCRIPTION',
       'CUSTOMER_SUB_TRADE_CHANNEL_DESCRIPTION','CUSTOMER_TRADE_CHANNEL_DESCRIPTION2'], axis=1)

CLTV¶

(Average Monetary Value per transaction) x (Expected Number of transactions per year) x (Expected Tenure in years) . Monetary value can be Revenue or Profit or Volume per transaction Customer Lifetime Value (CLV or CLTV) is the average revenue you can generate from customers over the entire lifetime of their account. In simple terms, it is the money you would make from a customer before churning.

For this we will create 3 features for meassuring the customer live value But our target is to predict the CLTV of a customer in 3 year, So we can have a comparable value of the value in 3 years of each customer. We will change the longetivity of the customer to 3 years and the number of transactions to 3 years.

In [ ]:
#revclvt 
df10['revclvt'] = df10['TOTAL_INVOICES']/df10['years_diff']*df10['longetivity']
#volclvt
df10['volclvt'] = df10['TOTAL_VOLUME']/df10['years_diff']*df10['longetivity']
#profitclvt
df10['profitclvt'] = df10['TOTAL_PROFIT']/df10['years_diff']*df10['longetivity']
#profitclvt 3 years
df10['profitclvt3'] = df10['TOTAL_PROFIT']/df10['years_diff']*3

Join product cltv with customers data¶

In [ ]:
#join df10 and df12

df13 = pd.merge(df10, df12, on='CUSTOMER_NUMBER_BLINDED', how='left')

#drop duplicates
df13 = df13.drop_duplicates()
df13=df13.drop([ 'years_diff','MAX_DATE', 'MONTHS_ACTIVE', 'ON_BOARDING_DATE','ADDRESS_CITY', 'COUNTY','MARKET_DESCRIPTION','BUSINESS_TYPE_EXTENSION_DESCRIPTION'], axis=1)
In [ ]:
df13.head()
Out[ ]:
CUSTOMER_NUMBER_BLINDED ACTIVE UNIQUE_PRODUCTS EXPECTED_TRANS_YEAR TOTAL_PROFIT TOTAL_VOLUME TOTAL_INVOICES CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION COLD_DRINK_CHANNEL_DESCRIPTION TMRR_MAX ... revclvt_2 revclvt_3 revclvt_4 revclvt_5 volclvt_0 volclvt_1 volclvt_2 volclvt_3 volclvt_4 volclvt_5
0 C0051046109640797 Active 55 1211.277410 14833.42 1543.0 43793.98 Other Shopping & Ser RETAIL 61.152265 ... 0.0 4.399003e+02 1.947757e+05 0.0 6087.788118 0.000000 -309.152256 10.594902 8494.805882 0.0
1 C0348074302380406 Active 293 9863.258913 196626.22 65981.0 797839.25 Grocery Shopping WHOLESALE 1114.072689 ... 0.0 1.450065e+06 1.146299e+06 0.0 28136.950882 1291.579023 -3011.338920 113754.912292 109013.238172 0.0
2 C0277089703710223 Active 67 2137.848469 20219.51 5908.0 101043.78 Third Party (Non-Con WHOLESALE 242.641240 ... 0.0 1.992113e+04 5.460199e+04 0.0 1129.758616 41.880744 -20.673599 2125.450872 2859.126914 0.0
3 C0112072503650635 Active 117 3301.570496 4698.02 715.0 16176.86 Other Shopping & Ser RETAIL 80.167835 ... 0.0 1.608157e+02 7.755052e+03 0.0 281.224392 14.213507 -24.901700 6.091503 446.710226 0.0
4 C0305009008200279 Active 16 875.640197 5440.52 406.0 17579.21 Eating & Drinking EATING/DRINKING 24.546947 ... 0.0 3.846574e+04 4.009032e+03 0.0 3025.666455 0.000000 -5163.856583 3608.459352 330.249308 0.0

5 rows × 40 columns

In [ ]:
df13.describe()
Out[ ]:
UNIQUE_PRODUCTS EXPECTED_TRANS_YEAR TOTAL_PROFIT TOTAL_VOLUME TOTAL_INVOICES TMRR_MAX longetivity revclvt volclvt profitclvt ... revclvt_2 revclvt_3 revclvt_4 revclvt_5 volclvt_0 volclvt_1 volclvt_2 volclvt_3 volclvt_4 volclvt_5
count 34997.000000 34997.000000 3.499700e+04 34997.000000 3.499700e+04 34997.000000 34997.000000 3.499700e+04 3.499700e+04 3.499700e+04 ... 3.499700e+04 3.499700e+04 3.499700e+04 3.499700e+04 3.499700e+04 3.499700e+04 3.499700e+04 3.499700e+04 3.499700e+04 34997.000000
mean 22.799326 360.513414 7.014707e+03 858.924912 2.202297e+04 50.412595 8.213564 1.540444e+05 5.791572e+03 4.865545e+04 ... 5.123854e+03 5.401059e+04 5.774288e+04 3.382194e+02 2.704254e+03 3.088674e+02 -3.593143e+03 1.541770e+03 2.844851e+03 21.651934
std 25.451121 700.854277 3.564474e+04 7408.197039 1.309124e+05 236.281483 8.283451 1.181153e+06 5.961992e+04 3.371066e+05 ... 8.055127e+04 4.583457e+05 5.949367e+05 2.190911e+04 3.153476e+04 1.035769e+04 3.105018e+04 2.641386e+04 4.856405e+04 833.568875
min 1.000000 1.858851 -7.950499e+04 -8016.000000 1.000000e-02 0.000333 0.002740 3.868996e-02 -5.790822e+05 -3.016180e+06 ... -1.244932e+06 -1.624556e+06 -1.650220e+05 -1.317070e+06 -1.066145e+04 -7.210959e+04 -2.133591e+06 -2.027397e+04 -4.755382e+03 -46624.266145
25% 9.000000 59.644265 6.274000e+02 31.000000 1.652070e+03 5.452338 2.013699 5.578197e+03 1.061732e+02 2.114215e+03 ... 0.000000e+00 0.000000e+00 1.209513e+03 0.000000e+00 2.806732e+01 0.000000e+00 -4.142466e+02 0.000000e+00 4.011569e+01 0.000000
50% 15.000000 153.246753 1.945760e+03 115.000000 5.093940e+03 12.743426 5.375342 2.101694e+04 4.695053e+02 7.877857e+03 ... 0.000000e+00 3.564559e+03 5.162063e+03 0.000000e+00 1.650972e+02 0.000000e+00 -3.818336e+01 6.393164e+01 1.833032e+02 0.000000
75% 26.000000 368.324125 5.381650e+03 378.000000 1.462452e+04 30.703487 11.353425 6.969400e+04 1.826380e+03 2.582853e+04 ... 0.000000e+00 2.127043e+04 1.991277e+04 0.000000e+00 8.388149e+02 0.000000e+00 0.000000e+00 3.775722e+02 7.699032e+02 0.000000
max 465.000000 23308.408438 2.528506e+06 509144.000000 7.181258e+06 10000.000000 41.441096 9.220280e+07 5.299795e+06 2.108559e+07 ... 6.715323e+06 2.785597e+07 7.658674e+07 2.371820e+06 2.575256e+06 1.774618e+06 2.133855e+05 3.418162e+06 8.036384e+06 45737.769080

8 rows × 35 columns

In [ ]:
#send to postgres df13 as customer_features
df13.to_sql('customer_features', engine, if_exists='replace', index=False)
In [ ]:
#df13 to csv
df13.to_csv('customer_features.csv', index=False)
In [ ]:
df13.columns
Out[ ]:
Index(['CUSTOMER_NUMBER_BLINDED', 'ACTIVE', 'UNIQUE_PRODUCTS',
       'EXPECTED_TRANS_YEAR', 'TOTAL_PROFIT', 'TOTAL_VOLUME', 'TOTAL_INVOICES',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION',
       'COLD_DRINK_CHANNEL_DESCRIPTION', 'TMRR_MAX', 'longetivity',
       'location_Cluster', 'revclvt', 'volclvt', 'profitclvt', 'profitclvt3',
       'EXPECTED_TRANS_YEAR_0', 'EXPECTED_TRANS_YEAR_1',
       'EXPECTED_TRANS_YEAR_2', 'EXPECTED_TRANS_YEAR_3',
       'EXPECTED_TRANS_YEAR_4', 'EXPECTED_TRANS_YEAR_5', 'profitclvt_0',
       'profitclvt_1', 'profitclvt_2', 'profitclvt_3', 'profitclvt_4',
       'profitclvt_5', 'revclvt_0', 'revclvt_1', 'revclvt_2', 'revclvt_3',
       'revclvt_4', 'revclvt_5', 'volclvt_0', 'volclvt_1', 'volclvt_2',
       'volclvt_3', 'volclvt_4', 'volclvt_5'],
      dtype='object')
In [ ]:
#df13 correlation matrix

corr = df13.corr()
#plot the heatmap in plotly 
import plotly.graph_objects as go

fig = go.Figure(data=go.Heatmap(
                     z=corr.values,
                        x=corr.columns,
                        y=corr.columns,
                        colorscale='Viridis'))
fig.show()

Predictive models¶

We will classify the customers in 2 groups: Risky and Non risky. We will use the features created in the previous section to predict the CLTV of the customers and the probability of being risky or not.

target variables¶

First, a diminishing returns curve of profit CLTV 3 is created by sorting the profit CLTV3 values in descending order and calculating the cumulative sum. Outliers are removed from this curve by calculating the first and third quartiles, interquartile range, and lower and upper bounds, and filtering out values outside this range.

Then, the KneeLocator package is used to find the elbow point, which represents the threshold for a good profit CLTV. The KneeLocator package takes in the indices and values of the cumulative sum curve, and uses a heuristic algorithm to find the point of maximum curvature, or the "knee" of the curve. The KneeLocator package also allows specifying the curve type and direction.

Finally, the elbow point is plotted on the cumulative sum curve using Plotly, and the corresponding value for the profit CLTV is printed.

This elbow point is used to create a binary target variable, where 1 represents a customer with a good profit CLTV and 0 represents a customer with a bad profit CLTV.

In [ ]:
#diminishing returns curve of profitclvt
pc0 = df13.sort_values('profitclvt3', ascending=False)['profitclvt3'].reset_index()

#calculate outliers
q1 = pc0['profitclvt3'].quantile(0.25)
q3 = pc0['profitclvt3'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
pc0 = pc0[(pc0['profitclvt3'] > lower_bound) & (pc0['profitclvt3'] < upper_bound)]

#profitclvt cumsum 
pcv = pc0['profitclvt3'].cumsum().reset_index()
#plot profitclvt in plotly

from kneed import KneeLocator

# Use the KneeLocator package to find the elbow point
kl = KneeLocator(pcv.index, pcv['profitclvt3'], curve="concave", direction="decreasing")

# Plot the elbow point
fig = px.line(pcv, x='index', y='profitclvt3', title='Elbow Method For Optimal k')
fig.add_vline(x=kl.elbow, line_width=3, line_dash="dash", line_color="green")
fig.show()

print(pc0[pc0.index == kl.elbow])
print(pc0.profitclvt3.median())
       index  profitclvt3
31248  10174  1071.889428
4369.683257918552

We will mix the risk of the customer also depending if they are inactive and there longetivity was less than 3 years and the CLTV is less than the elbow point.

In [ ]:
#create a risk category based on profitclvt3 and longevitiy and active
df13['risk'] = np.where(((df13['profitclvt3'] > 1071.889428) | ((df13['longetivity'] > 3) & (df13['ACTIVE']== 'Active'))), 'low', 'high')
#median profitclvt3 for each risk category

df13.groupby('risk')['profitclvt3'].median()
# the median of profitclvt3  is 4369.683257918552
#high     636.307959
#low     5669.904268
Out[ ]:
risk
high     636.307959
low     5669.904268
Name: profitclvt3, dtype: float64
In [ ]:
import sklearn as sk

from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier 
from sklearn.tree import export_text

from sklearn.model_selection import train_test_split, cross_validate,\
GridSearchCV, cross_val_score, KFold, ParameterGrid

from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support,\
accuracy_score, recall_score, precision_score, f1_score,\
confusion_matrix, classification_report,precision_recall_curve

# from sklearn.naive_bayes import GaussianNB
# from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC, LinearSVC
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import make_scorer
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier,\
BaggingClassifier, AdaBoostClassifier

from sklearn.metrics import roc_curve, roc_auc_score,\
 precision_recall_curve, average_precision_score,\
balanced_accuracy_score

from sklearn.model_selection import GridSearchCV, cross_val_score, ParameterGrid

Predict low vs high risk customers¶

In [ ]:
df14=df13
df14.columns
df14=df14.drop(['ACTIVE','TMRR_MAX','profitclvt3','longetivity'],axis=1)
In [ ]:
# Generate y counts and create a vector of y levels
imb_y_counts = df14['risk'].value_counts()
y_levels = imb_y_counts.index
y_counts = df14['risk'].value_counts()
print('In df_imb:','\n')
for i in y_levels:
  print(f'{round(100*y_counts[i]/df14.shape[0],2)} percent is {i}','\n')
In df_imb: 

94.03 percent is low 

5.97 percent is high 

We know that the data set is imbalanced so we will use diff methods to oversample the minority class and undersample the majority class.

classifiers¶

In [ ]:
# create a list of DecisionTreeClassifier estimator objects 
DTC3 = DecisionTreeClassifier(max_depth=3,random_state=42)
DTC4 = DecisionTreeClassifier(max_depth=4,random_state=42)
DTC5 = DecisionTreeClassifier(max_depth=5,random_state=42)
DTC_ent4 = DecisionTreeClassifier(criterion='entropy',max_depth=4,random_state=42)
DTC_list = [DTC3,DTC4,DTC_ent4, DTC5]
DTC_name_list = ['DTC3','DTC4','DTC_ent4','DTC5']

# SVC methods take a long time. Try three different kernels.
# Need to include probability=True for generating areas under ROC and precision-recall curves
SVC_rbf = SVC(probability=True)
SVC_linear = SVC(kernel='linear',probability=True)
SVC_poly = SVC(kernel='poly',degree=2,probability=True)
SVC_list = [SVC_rbf,SVC_linear, SVC_poly]
SVC_name_list = ['SVC_rbf','SVC_linear','SVC_poly']

# Reduce to four LogisticRegression objects
lr_lbfgs = LogisticRegression(random_state=42)
lr_newton = LogisticRegression(solver='newton-cg',random_state=42)
lr_lib = LogisticRegression(solver='liblinear',random_state=42) 
lr_lib_l1 = LogisticRegression(solver='liblinear',penalty = 'l1',random_state=42) 
lr_list = [lr_lbfgs,lr_newton, lr_lib,lr_lib_l1,]
lr_name_list = ['lbfgs_l2','newton_l2','lib_l2','lib_l1',] 
# Reduce to four rf objects. Don't work well for CD_imb2 (extreme class-imbalance)
rf_max7 = RandomForestClassifier(max_depth=7,random_state=42)
rf_entropy_max7 = RandomForestClassifier(criterion='entropy',max_depth=7,random_state=42)
rf_list = [rf_max7, rf_entropy_max7]
rf_name_list = ['rf_max7','rf_entropy_max7']

Prepare data for classfication¶

Split data column-wise into target variable, numeric variables and categorical variables Use pd.get_dummies() to generate dummy variables to encode categorical variables Create encoded model input features by joining dummy variables with numeric features Split row wise the target variable series and the encoded input features into train, validation, test sets

Define and run fun_enc5¶

Encode categorical columns using pd.get.dummies() and keep track of the number of encoded columns per original dataframe's column

In [ ]:
def fun_enc5(df_name,df,y): 
    # separate target variable and predictors
    df_target =df[y]
    df_features = df.drop(y, axis=1)
    # Separate numeric and catogorical predictors
    df_num_features = df_features.select_dtypes(exclude=['object'])
    df_cat_features = df_features.select_dtypes(include=['object'])
    # extract just object type (string) columns from the dataframe.
    # Get the column names of categorical variables as a list to prefix the encoded columns
    df_cat_list = df_cat_features.columns.tolist()

    # Create a transformer object and fit it to cat_features    
    # save dummy variables in a non-sparse matrix
    # This helps produce meaningful feature names 
    enc = OneHotEncoder(dtype=np.int8,sparse=False) 
    arr = enc.fit_transform(df_cat_features)
    # convert array to dataframe
    # use the encoder instance method get_feature_names to prefix our newly encoded 
    # columns with the existing column names
    df_cat_enc = pd.DataFrame(arr, columns=enc.get_feature_names_out(df_cat_list))
    # Added .reset_index(drop=True) to the following data frames 
    # to avoid creating extra rows by pd.concat() in this tutorial
    df_cat_enc = df_cat_enc.reset_index(drop=True)
    df_num_features = df_num_features.reset_index(drop=True)
    df_cat_enc = pd.DataFrame(arr, columns=enc.get_feature_names_out(df_cat_list))

    # Join the sparse dataframe of dummy variables with the data frame of numeric features
    df_features_enc = pd.concat([df_cat_enc,df_num_features],axis=1)
    print(f'Features names after encoding {df_name}\n')
    print(df_features_enc.columns,'\n')
    print(f'{df_name} has {df_features_enc.shape[0]} rows, {df_features_enc.shape[1]} columns\n')
    return df_features_enc, df_target;  

def fun_split_dist(df_name,df_features_enc, df_target, test_pct): 
    # split the predictors and the target data frame into test (test_pct) and 
    # train (1-test_pct) dataframes using the target data frame
    # Specify either train_size or test_size is sufficient. 
    # You can change random_state to seed the random number generator differently
    X_train, X_test, y_train, y_test = \
    train_test_split(df_features_enc, df_target, test_size=test_pct, random_state=42)
    # Generate y_counts in train and test sets 
    train_y_counts = y_train.value_counts()
    test_y_counts = y_test.value_counts()
    # print y distribution in percentage and frequency for train and test sets
    print(f'In {df_name} train set ({(1-test_pct)*100}%):','\n')
    for i in y_levels:
      print(f'{round(100*train_y_counts[i]/X_train.shape[0],2)}% or {train_y_counts[i]} instances belong to class {i}','\n')
    #
    print(f'In {df_name} test set ({(test_pct)*100}%):','\n')
    for i in y_levels:
      print(f'{round(100*test_y_counts[i]/X_test.shape[0],2)}% or {test_y_counts[i]} instances belong to class {i}','\n')
    return X_train, y_train, X_test, y_test;  

Run encoding and split data¶

In [ ]:
def log_transform(df, column):
    df[column] = np.log(df[column])
    df[column] = df[column] .replace([np.inf, -np.inf], np.nan)
    df[column] = df[column].fillna(0)
    return df

#log transform columns
for i in df14.columns:
    if df14[i].dtype == 'int64' or df14[i].dtype == 'float64':
        df14 = log_transform(df14, i)


#drop duplicates
df14 = df14.drop_duplicates()
#drop 'PRODUCT_SOLD_BLINDED'
df14 = df14.drop('CUSTOMER_NUMBER_BLINDED', axis=1)
df14.columns = df14.columns.astype(str)
CD_imb_features_enc, CD_imb_target= fun_enc5('df_imb',df14,'risk')

X_train, y_train, X_test, y_test = fun_split_dist('df_imb',CD_imb_features_enc, CD_imb_target, 0.2)
Features names after encoding df_imb

Index(['CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_At-Work',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Eating & Drinking',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Educational',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Entertainment/Recrea',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Grocery Shopping',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Other Shopping & Ser',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Third Party (Non-Con',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Travel/Transportatio',
       'COLD_DRINK_CHANNEL_DESCRIPTION_AT WORK',
       'COLD_DRINK_CHANNEL_DESCRIPTION_DESTINATION VENUE',
       ...
       'revclvt_2', 'revclvt_3', 'revclvt_4', 'revclvt_5', 'volclvt_0',
       'volclvt_1', 'volclvt_2', 'volclvt_3', 'volclvt_4', 'volclvt_5'],
      dtype='object', length=1402) 

df_imb has 34997 rows, 1402 columns

In df_imb train set (80.0%): 

93.97% or 26309 instances belong to class low 

6.03% or 1688 instances belong to class high 

In df_imb test set (20.0%): 

94.26% or 6598 instances belong to class low 

5.74% or 402 instances belong to class high 

In df_imb train set (80.0%): 

93.97% or 26309 instances belong to class low 

6.03% or 1688 instances belong to class high 

In df_imb test set (20.0%): 

94.26% or 6598 instances belong to class low 

5.74% or 402 instances belong to class high 

Model development and evaluation based on train-validation-test Cross-Validation¶

We will give cost or benefit for predictions to low risk or high risk customer. This values are based on the business rules and the cost of the marketing campaign. and can be changed to optimize the model.

             Predicted low risk |Predicted high risk

Actual low  |Increase of CLVT   |decrease of CLVT (no signing with good customenr) 

Actual high |decrease of CLVT    |No discount offered

We will start giving a cost of how much the median profit value in 3 years decreases by each type of error. and how much it increase for each correct prediction. This values can be changed with cost of the marketing campaign and the business rules. But for this project we will use this values.

        |Predicted high risk    |Predicted low risk

Actual high  |  0                    | -3733    

Actual low|- 1300.22101          | 1300.22101
In [ ]:
cb_factors1 = np.array([[0,-3733],[ -1300,  1300]])
cb_factors1
Out[ ]:
array([[    0, -3733],
       [-1300,  1300]])
In [ ]:
# Generate pred_test to show new metrics with test sets
clf=LogisticRegression(random_state=42)
clf_f = clf.fit(X_train,y_train)
pred_test = clf_f.predict(X_test)
cf = metrics.confusion_matrix(y_test,pred_test)
cf
#print the confusion matrix with class labels and totals

print(pd.DataFrame(cf, index=['true:high', 'true:low'], columns=['pred:high', 'pred:low']))
           pred:high  pred:low
true:high        193       209
true:low          84      6514
In [ ]:
def fun_auprc(y_true, y_pred_prob):
  precision_arr, recall_arr, thresholds = precision_recall_curve(y_true, y_pred_prob, pos_label = 'yes')
  auprc = round(metrics.auc(recall_arr, precision_arr),2)
  return auprc
def fun_avg_net_benefit(y_true, y_pred):
  # create the confusion matrix (2x2 array)
  cf = confusion_matrix(y_true,y_pred)
  # now simply multiply the array with the cost array, sum it up, 
  # then divide the total by the number of instances
  netb_arr = cf* cb_factors1
  return(round(netb_arr.sum()/y_true.shape[0],2)) 
avg_netb = make_scorer(fun_avg_net_benefit)   

auprc = make_scorer(fun_auprc, needs_proba=True)
# fun_avg_net_benefit creates a confusion matrix and multiply it
# with the cost array. sum up the result and return it

custom_scoring5 = {'acc':'accuracy','bacc':'balanced_accuracy',
                   'auroc': 'roc_auc','auprc': auprc,'avg_netb' : avg_netb}
index_names = ['mean','std']                   

def fun_clf_cv_val_test5(clf, clf_name, X_train,y_train,X_test,y_test,kfold):

  # cross_validate an estimator on train data to generate validation performance
  scores = cross_validate(clf,X_train,y_train,cv=kfold,scoring=custom_scoring5)
  # Generate means and stds of all of the custom_scoring5 metrics
  scores_df = pd.DataFrame(scores)
  scores_df.columns= ['v_fit_time','v_test_time','v_acc',\
                    'v_bacc', 'v_auroc', 'v_auprc','v_avg_netb']
  # drop the first two columns
  scores_df = scores_df.drop(['v_fit_time','v_test_time'],axis=1)
  val_mean = round(scores_df.mean(),2)
  val_std = round(scores_df.std(),2)
  # Save val_mean and val_std as rows in val_results_df
  val_results_df = pd.DataFrame(data=(val_mean,val_std), index=index_names)
  val_results_df['clf_name'] = clf_name # helps identifies clf in val_results 
  # Generate test results based on a model trained by the whole train set
  clf_f = clf.fit(X_train,y_train)
  pred_test = clf_f.predict(X_test)
  pred_proba_test_yes = clf_f.predict_proba(X_test)[:,1]
  auroc = roc_auc_score(y_test,pred_proba_test_yes)
  auprc = fun_auprc(y_test,pred_proba_test_yes)
  avg_netb = fun_avg_net_benefit(y_test,pred_test)
  test_results_df= round(pd.DataFrame({'t_acc': accuracy_score(y_test,pred_test),\
                        't_bacc': balanced_accuracy_score(y_test,pred_test),\
                        't_auroc':auroc,\
                        't_auprc':auprc,\
                        't_avg_netb': avg_netb},index=[0]),2)
  test_results_df['clf_name'] = clf_name # helps identifies clf in test_results
  #Now bundle up validation and test results in a single row
  cv_val_test_results_df = pd.DataFrame(
      {'clf_name':[clf_name],'validate_results':[val_results_df],\
      'test_results':[test_results_df]})
  return cv_val_test_results_df

# Define the fun_cv_val_multi_clfs5() function
def fun_cv_val_multi_clfs5(clf_list,clf_name_list,X_train,y_train,X_test,y_test,kfold):
  multi_clf_results_list = []
  for i in range(0,len(clf_list)):
    clf_results_df = fun_clf_cv_val_test5(clf_list[i],clf_name_list[i],X_train,y_train,X_test,y_test,kfold)
    multi_clf_results_list.append(clf_results_df)
  # force pd.concat() to create a good index
  cv_val_multi_clf_results_df = pd.concat(multi_clf_results_list).reset_index(drop=True)
  return cv_val_multi_clf_results_df

def fun_multi_clf_reports5(multi_clf_results_df):
  val_list = []
  test_list = []
  # Get validate_results and test_results for each clf
  for index, model_row in multi_clf_results_df.iterrows():
    val_list.append(model_row['validate_results'])
    test_list.append(model_row['test_results'])
  # end of for loop
  # Create multi_clf_val_report_df and multi_clf_test_report_df with results for a classifier in a single row
  multi_clf_val_df = pd.concat(val_list)
  multi_clf_val_df = multi_clf_val_df.reset_index(drop=False)
  multi_clf_val_df = multi_clf_val_df.rename(columns={"index": "stat_type"})
  #
  multi_clf_val_mean_df = multi_clf_val_df[multi_clf_val_df['stat_type'].str.match('mean')].reset_index(drop=True)
  multi_clf_val_std_df = multi_clf_val_df[multi_clf_val_df['stat_type'].str.match('std')].reset_index(drop=True)
  # 
  multi_clf_test_df = pd.concat(test_list)
  multi_clf_test_df = multi_clf_test_df.reset_index(drop=False)
  #
  return multi_clf_val_mean_df, multi_clf_val_std_df, multi_clf_test_df 

def fun_cv_val_test_comparison5(multi_clf_val_mean_df,multi_clf_val_std_df,multi_clf_test_report_df,comp_name):  
  # show tabular comparisons and sns.catplot() of 
  # validation and test results by metric
  clf_name_df = multi_clf_test_report_df['clf_name']
  for i in range(0,len(metric_map.v_metric)):
    val_metric_mean_df = multi_clf_val_mean_df[metric_map.v_metric.iloc[i]]
    val_metric_std_df = multi_clf_val_std_df[metric_map.v_metric.iloc[i]]
    test_metric_df = multi_clf_test_report_df[metric_map.test_metric.iloc[i]]
    # create wide_df
    wide_df = pd.concat([clf_name_df,test_metric_df, val_metric_mean_df, val_metric_std_df],axis=1)
    # rename the columns
    wide_df.columns = ['clf_name','test_set_result','val_set_mean',
                     'val_set_std']
    print(metric_map.x_label.iloc[i],'from test (left) and from validation in mean and std (right)\n')
    print(wide_df,'\n')
    # create long_df
    long_df = pd.DataFrame(columns=['clf_name',metric_map.x_label.iloc[i],comp_name])
    for r in range(0, len(clf_name_df)):
      long_df = long_df.append({'clf_name':clf_name_df.iloc[r],
                          metric_map.x_label.iloc[i]: test_metric_df.iloc[r], 
                          comp_name:'test_set'},ignore_index=True)
      long_df = long_df.append({'clf_name':clf_name_df.iloc[r],
                          metric_map.x_label.iloc[i]: val_metric_mean_df.iloc[r],
                          comp_name:'val_set'}, ignore_index=True)

      # end of r loop
    # use seaborn's catplot() to draw performance from test-set and val-set 
    # in groups of classifier
    g = sns.catplot(
      data=long_df, kind="bar",
      y="clf_name", x=metric_map.x_label.iloc[i], hue=comp_name, orient='h'
      )
    # g.set(xlim=(0.0, 1.0))
    g.fig.set_figwidth(10)
    # end of i loop and function
In [ ]:
# create a dataframe of val_metric and scorer and col mappings that will be used
# to select result score from the validation results and test results
metric_map = pd.DataFrame(columns=['v_metric','test_metric','x_label'])
metric_map = metric_map.append({'v_metric' : 'v_acc' , 'test_metric' : 't_acc', 'x_label':'Weighted_accuracy'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_bacc' , 'test_metric' : 't_bacc', 'x_label':'Balanced_accuracy'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_auroc' , 'test_metric' : 't_auroc', 'x_label':'Area_Under_ROC'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_auprc' , 'test_metric' : 't_auprc', 'x_label':'Area_Under_PRC'} , ignore_index=True)
metric_map = metric_map.append({'v_metric' : 'v_avg_netb' , 'test_metric' : 't_avg_netb', 'x_label':'Average_Net_Benefits'} , ignore_index=True)
metric_map
Out[ ]:
v_metric test_metric x_label
0 v_acc t_acc Weighted_accuracy
1 v_bacc t_bacc Balanced_accuracy
2 v_auroc t_auroc Area_Under_ROC
3 v_auprc t_auprc Area_Under_PRC
4 v_avg_netb t_avg_netb Average_Net_Benefits
In [ ]:
#def fun_cv_val_multi_clf_all_tasks5():
def fun_cv_val_multi_clf_all_tasks5(clf_list, clf_name_list,X_train,y_train,X_test,y_test,kfold,comp_name):
  multi_clf_results_df = fun_cv_val_multi_clfs5(clf_list, clf_name_list,X_train,y_train,X_test,y_test,kfold)
  multi_clf_val_mean_df, multi_clf_val_std_df, multi_clf_test_report_df = fun_multi_clf_reports5(multi_clf_results_df)
  fun_cv_val_test_comparison5(multi_clf_val_mean_df,multi_clf_val_std_df, multi_clf_test_report_df,comp_name)

Logistic Regression evaluation¶

In [ ]:
#lr_name_list = ['lbfgs_l2','newton_l2','lib_l2','lib_l1',] 

kfold=3
fun_cv_val_multi_clf_all_tasks5(lr_list,lr_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2             0.96          0.96          0.0
1  newton_l2             0.96          0.96          0.0
2     lib_l2             0.96          0.96          0.0
3     lib_l1             0.96          0.96          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2             0.73          0.74         0.02
1  newton_l2             0.73          0.74         0.00
2     lib_l2             0.72          0.73         0.01
3     lib_l1             0.73          0.74         0.01 

Area_Under_ROC from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2             0.97          0.97          0.0
1  newton_l2             0.97          0.97          0.0
2     lib_l2             0.97          0.97          0.0
3     lib_l1             0.97          0.97          0.0 

Area_Under_PRC from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2              0.5           0.5          0.0
1  newton_l2              0.5           0.5          0.0
2     lib_l2              0.5           0.5          0.0
3     lib_l1              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2          1085.17       1074.61         8.64
1  newton_l2          1083.36       1075.56         3.25
2     lib_l2          1080.91       1072.08         2.74
3     lib_l1          1084.54       1077.35         6.18 

Random Forest evaluation¶

In [ ]:
#rf_name_list = ['rf_default','rf_entropy_default','rf_max7','rf_entropy_max7']
kfold=3
fun_cv_val_multi_clf_all_tasks5(rf_list,rf_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default             0.94          0.94          0.0
1  rf_entropy_default             0.94          0.94          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default              0.5           0.5          0.0
1  rf_entropy_default              0.5           0.5          0.0 

Area_Under_ROC from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default             0.95          0.95         0.01
1  rf_entropy_default             0.96          0.96         0.00 

Area_Under_PRC from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default              0.5           0.5          0.0
1  rf_entropy_default              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default          1010.96        996.55          0.3
1  rf_entropy_default          1010.96        996.55          0.3 

Suport Vector Machine evaluation¶

In [ ]:
#SVC_name_list = ['SVC_rbf','SVC_linear','SVC_poly']
kfold=3
fun_cv_val_multi_clf_all_tasks5(SVC_list,SVC_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

     clf_name  test_set_result  val_set_mean  val_set_std
0     SVC_rbf             0.96          0.95          0.0
1  SVC_linear             0.96          0.96          0.0
2    SVC_poly             0.95          0.95          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

     clf_name  test_set_result  val_set_mean  val_set_std
0     SVC_rbf             0.68          0.67         0.01
1  SVC_linear             0.75          0.76         0.00
2    SVC_poly             0.64          0.63         0.00 

Area_Under_ROC from test (left) and from validation in mean and std (right)

     clf_name  test_set_result  val_set_mean  val_set_std
0     SVC_rbf             0.98          0.98          0.0
1  SVC_linear             0.97          0.97          0.0
2    SVC_poly             0.97          0.97          0.0 

Area_Under_PRC from test (left) and from validation in mean and std (right)

     clf_name  test_set_result  val_set_mean  val_set_std
0     SVC_rbf              0.5           0.5          0.0
1  SVC_linear              0.5           0.5          0.0
2    SVC_poly              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

     clf_name  test_set_result  val_set_mean  val_set_std
0     SVC_rbf          1070.09       1055.56         1.61
1  SVC_linear          1080.17       1077.37         0.34
2    SVC_poly          1054.51       1042.64         3.52 

DecisionTreeClassifier evaluation¶

In [ ]:
#DTC_name_list = ['DTC3','DTC4','DTC_ent4','DTC5']
kfold=3
fun_cv_val_multi_clf_all_tasks5(DTC_list,DTC_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3             0.95          0.95          0.0
1      DTC4             0.96          0.95          0.0
2  DTC_ent4             0.95          0.95          0.0
3      DTC5             0.95          0.96          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3             0.70          0.69         0.03
1      DTC4             0.79          0.78         0.03
2  DTC_ent4             0.70          0.74         0.06
3      DTC5             0.77          0.74         0.03 

Area_Under_ROC from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3             0.96          0.96          0.0
1      DTC4             0.97          0.96          0.0
2  DTC_ent4             0.97          0.97          0.0
3      DTC5             0.97          0.97          0.0 

Area_Under_PRC from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3              0.5           0.5          0.0
1      DTC4              0.5           0.5          0.0
2  DTC_ent4              0.5           0.5          0.0
3      DTC5              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3          1058.68       1050.71         7.43
1      DTC4          1084.21       1071.19         6.97
2  DTC_ent4          1061.39       1059.51         8.20
3      DTC5          1078.32       1072.51         8.15 

grid search:Higher values of class_weight for the minority class¶

Use of grid search to find the best parameters for class weight we will give a higer weight to the minority class to see if the model improves and we will evaluate on the best model of each of the previous models

In [ ]:
weight_grid = {'class_weight':[{'high': w} for w in np.linspace(10,100,5)]}
DTC4 = DecisionTreeClassifier(max_depth=4,random_state=42)
SVC_linear = SVC(kernel='linear',probability=True,random_state=42)
rf_entropy_max7 = RandomForestClassifier(criterion='entropy',max_depth=7,random_state=42) 
lr_newton = LogisticRegression(solver='newton-cg',random_state=42)

w1_clf_list=[DTC4,SVC_linear,rf_entropy_max7,lr_newton]
w_clf_name_list=['DTC4','SVC_linear','rf_entropy_max7','lr_newton']
In [ ]:
def fun_clf_gscv_avg_netb(clf, clf_name, X_train,y_train,X_test,y_test,kfold,weight_grid):
  #Based on a single scoring_type, use GridSearchCV() to find the best weight, estimator and score 
  #Use a scoring_type out of the new metrics
  clf_gscv = \
    GridSearchCV(clf, weight_grid, refit=True, cv=kfold,scoring=make_scorer(fun_avg_net_benefit)).fit(X_train, y_train)
  # Generate test results based on clf_gscv.best_estimator_
  
  pred_y = clf_gscv.best_estimator_.predict(X_test)
  test_avg_netb = fun_avg_net_benefit(y_test,pred_y)
  # Now bundle up validation and test results in a single row
  gscv_avg_netb_df = pd.DataFrame(
      {'clf_name':[clf_name],'best_gscv_weight': [clf_gscv.best_params_] , \
       'best_gscv_avg_netb':clf_gscv.best_score_,\
      'test_avg_netb':test_avg_netb})
  return gscv_avg_netb_df

Run to see the results with k fold =3

In [ ]:
fun_clf_gscv_avg_netb(DTC4 ,'DTC4',X_train,y_train,X_test,y_test,3,weight_grid)
Out[ ]:
clf_name best_gscv_weight best_gscv_avg_netb test_avg_netb
0 DTC4 {'class_weight': {'high': 10.0}} 998.536667 1014.5
In [ ]:
fun_clf_gscv_avg_netb(SVC_linear ,'SVC_linear',X_train,y_train,X_test,y_test,3,weight_grid)
Out[ ]:
clf_name best_gscv_weight best_gscv_avg_netb test_avg_netb
0 SVC_linear {'class_weight': {'high': 10.0}} 1057.93 1067.04
In [ ]:
fun_clf_gscv_avg_netb(rf_entropy_max7 ,'Srf_entropy_max7',X_train,y_train,X_test,y_test,3,weight_grid)
Out[ ]:
clf_name best_gscv_weight best_gscv_avg_netb test_avg_netb
0 Srf_entropy_max7 {'class_weight': {'high': 10.0}} 929.16 934.66
In [ ]:
fun_clf_gscv_avg_netb(lr_newton ,'lr_newton',X_train,y_train,X_test,y_test,3,weight_grid)
Out[ ]:
clf_name best_gscv_weight best_gscv_avg_netb test_avg_netb
0 lr_newton {'class_weight': {'high': 10.0}} 1064.243333 1069.84

Run the models with fewer features¶

The data features we used on previous model have some corralation between them. We will run the models with fewer features to see if the model improves, we are selecting mostlyfeatures we will get at the moment of onboarding a new customer.

In [ ]:
df15=df13
df15.drop(['ACTIVE','TOTAL_PROFIT', 'TOTAL_VOLUME', 'TOTAL_INVOICES','TMRR_MAX', 'longetivity','revclvt', 'volclvt', 'profitclvt', 'profitclvt3',
       'EXPECTED_TRANS_YEAR_0', 'EXPECTED_TRANS_YEAR_1',
       'EXPECTED_TRANS_YEAR_2', 'EXPECTED_TRANS_YEAR_3',
       'EXPECTED_TRANS_YEAR_4', 'EXPECTED_TRANS_YEAR_5', 'profitclvt_0',
       'profitclvt_1', 'profitclvt_2', 'profitclvt_3', 'profitclvt_4',
       'profitclvt_5', 'revclvt_0', 'revclvt_1', 'revclvt_2', 'revclvt_3',
       'revclvt_4', 'revclvt_5', 'volclvt_0', 'volclvt_1', 'volclvt_2',
       'volclvt_3', 'volclvt_4', 'volclvt_5',
       ],axis=1,inplace=True)
df15.columns
Out[ ]:
Index(['CUSTOMER_NUMBER_BLINDED', 'UNIQUE_PRODUCTS', 'EXPECTED_TRANS_YEAR',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION',
       'COLD_DRINK_CHANNEL_DESCRIPTION', 'location_Cluster', 'risk'],
      dtype='object')
In [ ]:
#log transform columns
for i in df15.columns:
    if df15[i].dtype == 'int64' or df15[i].dtype == 'float64':
        df15 = log_transform(df15, i)


#drop duplicates
df15 = df15.drop_duplicates()
#drop 'PRODUCT_SOLD_BLINDED'
df15 = df15.drop('CUSTOMER_NUMBER_BLINDED', axis=1)
df15.columns = df15.columns.astype(str)
CD_imb_features_enc, CD_imb_target= fun_enc5('df_imb',df15,'risk')

X_train, y_train, X_test, y_test = fun_split_dist('df_imb',CD_imb_features_enc, CD_imb_target, 0.2)
Features names after encoding df_imb

Index(['CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_At-Work',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Eating & Drinking',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Educational',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Entertainment/Recrea',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Grocery Shopping',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Other Shopping & Ser',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Third Party (Non-Con',
       'CUSTOMER_ACTIVITY_CLUSTER_DESCRIPTION_Travel/Transportatio',
       'COLD_DRINK_CHANNEL_DESCRIPTION_AT WORK',
       'COLD_DRINK_CHANNEL_DESCRIPTION_DESTINATION VENUE',
       ...
       'location_Cluster_992', 'location_Cluster_993', 'location_Cluster_994',
       'location_Cluster_995', 'location_Cluster_996', 'location_Cluster_997',
       'location_Cluster_998', 'location_Cluster_999', 'UNIQUE_PRODUCTS',
       'EXPECTED_TRANS_YEAR'],
      dtype='object', length=1372) 

df_imb has 34997 rows, 1372 columns

In df_imb train set (80.0%): 

93.97% or 26309 instances belong to class low 

6.03% or 1688 instances belong to class high 

In df_imb test set (20.0%): 

94.26% or 6598 instances belong to class low 

5.74% or 402 instances belong to class high 

In [ ]:
#### Logistic Regression evaluation
#lr_name_list = ['lbfgs_l2','newton_l2','lib_l2','lib_l1',] 

kfold=3
fun_cv_val_multi_clf_all_tasks5(lr_list,lr_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2             0.94          0.94          0.0
1  newton_l2             0.94          0.94          0.0
2     lib_l2             0.94          0.94          0.0
3     lib_l1             0.94          0.94          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2             0.59          0.58         0.01
1  newton_l2             0.59          0.58         0.01
2     lib_l2             0.59          0.58         0.01
3     lib_l1             0.59          0.58         0.01 

Area_Under_ROC from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2              0.9          0.89         0.00
1  newton_l2              0.9          0.89         0.00
2     lib_l2              0.9          0.89         0.00
3     lib_l1              0.9          0.89         0.01 

Area_Under_PRC from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2              0.5           0.5          0.0
1  newton_l2              0.5           0.5          0.0
2     lib_l2              0.5           0.5          0.0
3     lib_l1              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

    clf_name  test_set_result  val_set_mean  val_set_std
0   lbfgs_l2          1025.61       1010.11         5.14
1  newton_l2          1024.70       1009.79         5.28
2     lib_l2          1024.01       1010.36         5.00
3     lib_l1          1024.75       1011.66         5.48 

In [ ]:
#### Random Forest evaluation
#rf_name_list = ['rf_default','rf_entropy_default','rf_max7','rf_entropy_max7']
kfold=3
fun_cv_val_multi_clf_all_tasks5(rf_list,rf_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default             0.94          0.94          0.0
1  rf_entropy_default             0.94          0.94          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default              0.5           0.5          0.0
1  rf_entropy_default              0.5           0.5          0.0 

Area_Under_ROC from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default             0.85          0.86         0.02
1  rf_entropy_default             0.89          0.88         0.01 

Area_Under_PRC from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default              0.5           0.5          0.0
1  rf_entropy_default              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

             clf_name  test_set_result  val_set_mean  val_set_std
0          rf_default          1010.96        996.55          0.3
1  rf_entropy_default          1010.96        996.55          0.3 

In [ ]:
#### DecisionTreeClassifier evaluation
#DTC_name_list = ['DTC3','DTC4','DTC_ent4','DTC5']
kfold=3
fun_cv_val_multi_clf_all_tasks5(DTC_list,DTC_name_list,X_train,y_train,X_test,y_test,kfold,'data_source')
Weighted_accuracy from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3             0.95          0.94          0.0
1      DTC4             0.94          0.94          0.0
2  DTC_ent4             0.94          0.94          0.0
3      DTC5             0.94          0.94          0.0 

Balanced_accuracy from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3             0.55          0.57         0.04
1      DTC4             0.55          0.58         0.04
2  DTC_ent4             0.55          0.55         0.01
3      DTC5             0.56          0.60         0.04 

Area_Under_ROC from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3             0.88          0.85         0.01
1      DTC4             0.88          0.86         0.01
2  DTC_ent4             0.90          0.89         0.01
3      DTC5             0.89          0.86         0.01 

Area_Under_PRC from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3              0.5           0.5          0.0
1      DTC4              0.5           0.5          0.0
2  DTC_ent4              0.5           0.5          0.0
3      DTC5              0.5           0.5          0.0 

Average_Net_Benefits from test (left) and from validation in mean and std (right)

   clf_name  test_set_result  val_set_mean  val_set_std
0      DTC3          1024.77       1008.92         0.99
1      DTC4          1022.70       1009.81         0.64
2  DTC_ent4          1023.01       1008.71         0.58
3      DTC5          1024.46       1009.20         2.69 

Modeling Conclusions¶

After conducting exploratory data analysis and creating additional features from the data provided by Swire, we were able to develop a model that predicts the risk of doing business with a customer. This risk variable was calculated using the customer's CLTV over a period of 3 years and the probability of being inactive over the next 3 years.

To calculate the customer's lifetime value, we used their profit over the last 3 years and the expected number of transactions per year over the next 3 years. We then created a feature that predicts the customer's lifetime value in 3 years.

We defined a customer as risky if they were inactive (no orders in the last 3 months) and had been a customer for less than 3 years, and if their CLTV over the next 3 years was less than the elbow point of the diminishing returns curve.

We also used Kmeans to cluster products and predict the customer's CLTV for each product type. This information was used as a feature to predict the risk and to monitor transaction behavior for any potential issues.

We used four models (Logistic Regression, Random Forest, Support Vector Machine, and DecisionTreeClassifier) to predict customer risk, evaluating each model with different parameters. We developed a cost matrix to evaluate the model based on the cost of marketing campaigns and business rules. We also used grid search to find the best parameters for class weight to give a higher weight to the minority class and determine if the model would improve.

We performed feature selection, focusing mainly on the features we would obtain when onboarding a new customer, and removed highly correlated features to improve the model's performance.

Results¶

The models were evaluated on several metrics, and the data was split into train, validation, and test sets, with a cross-validation of 3 folds. After reviewing the different measures, we will choose the model with the highest average net benefit. We analyzed the confusion matrix and cost matrix to identify the types of errors the model was making and the cost of each error.

                |Predicted high risk    |Predicted low risk

Actual high  |  0                        | -3733    

Actual low  | -1300                       | 1300

The cost is related to how it affects the customer's lifetime profit over 3 years. The cost of not signing a good customer is 3733, while the cost of signing a bad customer is 1300. This decision was made based on the median of a high-risk customer, which is 1300 below the median CLTV for 3 years. and a good prediction of a low risk will increase the CLTV by 1300.

The best model was a SVC linear when using all the features of transaction history and products CLVTS. The model had an average net benefit of 1087. in test and 1077.37 on validation. This types of models take clos to 90 minuts to run.

When the minority class was weighted and run using a grid search for the best weight was 10 for all models. And the best model was newton logistic regression with a score of 1064.

And finally we run the models to see there performance with fewer features. The best model was the logistic regression by not high diference from other models. This evaluation shsws that reducing to demographic and characteristics of the customers will not higly affect the model performance and will be easier to implement.

Future work¶

We can keep including more features to the model and run models with different scenarios that drops features to see there performance. We can also run the models with different cost matrix to see how the model performs with different cost of the marketing campaign and business rules. We will create a dashboard containing this new features that will be used to understand the customer base and behavior guiding on how to increase CLTV and reduce the risk of doing business with a customer.

Disclaimer¶

The functions and modeling implemented in this project where done for educational purposes in advance data mining. By profesor Olivia Sheng. The data used in this project was provided by Swire Beverages. The results of this project are not to be used for any other purpose than educational.